Since the data is so large, we prepare some filters and computations in advance and load the processed data.
The filters applied to the data are:
Within the crime data, we filter out major property crimes (robberies, burglaries, larceny, or theft) and violent crimes (assaults, homicides, shootings).
The thefts and violent crimes data is then aggregated spatially in two ways:
# # Aggregate crime data spatially in two ways:
# # 1. Count number of crimes within 150 meters of each parcel
# # 2. Compute kernel density estimation of crimes within 150 meters of each
# # parcel
# # Filter major property crimes and violent crimes
# stealin <- z[grep("ROBBERY|BURGLARY|LARCENY|THEFT",
# z$UCR_1_Category), ]
# hurtin <- z[grep("ASSAULT|HOMICIDE|SHOOTING", z$UCR_1_Category), ]
# # Get number of thefts and violent crimes within 150 meters of each parcel
# p$Thefts <- st_is_within_distance(st_transform(p, crs = 26956),
# st_transform(stealin, crs = 26956),
# dist = units::set_units(150, "m")) |>
# sapply(length)
# p$Violence <- st_is_within_distance(st_transform(p, crs = 26956),
# st_transform(hurtin, crs = 26956),
# dist = units::set_units(150, "m")) |>
# sapply(length)
# # Compute KDE of thefts and violent crimes
# # Bandwidth selection using cross-validation (Hscv)
# p_proj <- p |> st_transform(crs = 26956) |> st_centroid() |> st_coordinates()
# v_proj <- hurtin |> st_transform(crs = 26956) |> st_coordinates()
# p$Violence_KDE <- kde(v_proj, Hscv(v_proj)) |> predict(x = p_proj)
# t_proj <- stealin |> st_transform(crs = 26956) |> st_coordinates()
# p$Thefts_KDE <- kde(t_proj, Hscv(t_proj)) |> predict(x = p_proj)
#
# # Save parcel data with computed crime statistics
# st_write(p, ".//data//parcel_hartford_single_family_crimestats.geojson")
# Read in filtered Hartford crime and parcel data
z <- st_read(".//data//crime_hartford_2016_2021.geojson")
## Reading layer `crime_hartford_2016_2021' from data source
## `C:\Users\llint\OneDrive - Yale University\classes\625\CT Property\SDS625\data\crime_hartford_2016_2021.geojson'
## using driver `GeoJSON'
## Simple feature collection with 197060 features and 12 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -72.71865 ymin: 41.72403 xmax: -72.65041 ymax: 41.80719
## Geodetic CRS: WGS 84
p <- st_read(".//data//parcel_hartford_single_family_crimestats.geojson")
## Reading layer `parcel_hartford_single_family_crimestats' from data source
## `C:\Users\llint\OneDrive - Yale University\classes\625\CT Property\SDS625\data\parcel_hartford_single_family_crimestats.geojson'
## using driver `GeoJSON'
## Simple feature collection with 7239 features and 51 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -72.71637 ymin: 41.72374 xmax: -72.65809 ymax: 41.80743
## Geodetic CRS: WGS 84
# Get polygons for Hartford's census tracts
hartford_tracts <- st_filter(tracts(state = "CT"),
subset(county_subdivisions(state = "CT"),
NAMELSAD == "Hartford town"),
.predicate = st_within) |>
st_transform(crs = st_crs(z))
## Retrieving data for the year 2021
## Retrieving data for the year 2021
# Re-order condition description of parcels
p$Condition_Description <- factor(
as.character(p$Condition_Description),
levels = c("Dilapidated", "Very Poor", "Poor", "Fair", "Fair-Avg", "Average",
"Avg-Good", "Good", "Good-VG", "Very Good", "Excellent"))
# Select predictors and filter out NA's
X <- p[, c("OBJECTID", "Assessed_Total",
"Thefts", "Violence", "Thefts_KDE", "Violence_KDE",
"Living_Area", "Effective_Area", "AYB",
"Number_of_Bedroom", "Number_of_Baths",
"Condition_Description")] %>%
rename(Price = Assessed_Total, Year = AYB,
Bed = Number_of_Bedroom, Bath = Number_of_Baths,
Condition = Condition_Description) %>%
na.omit()
# Check that we haven't dropped too many rows
nrow(p) # 7239 rows
## [1] 7239
nrow(X) # 7220 rows
## [1] 7220
The manually curated variables are:
Price: Assessed total value of the parcelThefts: Number of thefts within 150 meters of the
parcelViolence: Number of violent crimes within 150 meters of
the parcelThefts_KDE: Kernel density estimation of theftsViolence_KDE: Kernel density estimation of
violenceLiving_Area: Living area of the parcelEffective_Area: Effective area of the parcelYear: Approximate year the parcel was builtBed: Number of bedrooms in the parcelBath: Number of bathrooms in the parcelThe highly correlated numeric covariates are
Overall, there does not seem to be a strong correlation between A couple numeric covariates appear to be correlated with the condition of the parcel:
dilapidated condition.
The median thefts near parcels in very poor condition calculated by KDE
is relatively higher than the median calculated by counts.All covariates appear to correlate with the response variable,
Assessed_Total.
# Plot pairwise correlation between numeric predictors
ggpairs(X, columns = c(2:10), lower = list(continuous = "points"),
progress = F) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Plot boxplots of numeric predictors with respect to condition description
# Pivot data longer so that we can facet by variable
LX <- X %>%
pivot_longer(cols = c(Thefts, Violence, Thefts_KDE, Violence_KDE,
Living_Area, Effective_Area,
Year, Bed, Bath, Price),
names_to = "variable", values_to = "value")
ggplot(data = LX, aes(x = variable, y = value)) +
geom_boxplot(aes(fill = Condition), position = position_dodge(1)) +
facet_wrap( ~ variable, scales = "free", ) +
theme(axis.title.x = element_blank(), axis.text.x = element_blank(),
axis.ticks.x = element_blank(), panel.grid.major.x = element_blank(),
axis.title.y = element_blank())
We first consider the Thefts and Violence
variables computed by counting crime occurrences within a 150-meter
radius of a parcel.
Variable Selection. We leave out Thefts since
it is highly correlated with Violence but has a lower
correlation with Price than Violence does. For
similar reasons, we leave out Effective_Area and keep
Living_Area. We keep both Bed and
Bath for now.
Transformations. From the correlation plots, we can see that
Price, Living_Area, and Violence
are right-skewed. To adjust for this, we take the log of
Price and the square root of Living_Area and
Violence. The residual plots look much better for the
transformed model.
# Fit linear models with no transformations
m0 <- lm(Price ~ Violence + Living_Area + Year + Bed + Bath + Condition,
data = X)
summary(m0)
##
## Call:
## lm(formula = Price ~ Violence + Living_Area + Year + Bed + Bath +
## Condition, data = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -75062 -7023 856 6819 110854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.458e+05 1.440e+04 -17.063 < 2e-16 ***
## Violence -2.521e+02 6.990e+00 -36.065 < 2e-16 ***
## Living_Area 3.349e+01 3.255e-01 102.893 < 2e-16 ***
## Year 1.132e+02 6.656e+00 17.002 < 2e-16 ***
## Bed -2.306e+03 2.293e+02 -10.061 < 2e-16 ***
## Bath 5.426e+03 3.556e+02 15.258 < 2e-16 ***
## ConditionVery Poor 6.029e+02 8.696e+03 0.069 0.9447
## ConditionPoor 1.345e+04 6.813e+03 1.974 0.0484 *
## ConditionFair 2.685e+04 6.408e+03 4.190 2.82e-05 ***
## ConditionFair-Avg 3.280e+04 6.258e+03 5.241 1.65e-07 ***
## ConditionAverage 4.180e+04 6.163e+03 6.784 1.27e-11 ***
## ConditionAvg-Good 4.936e+04 6.158e+03 8.017 1.26e-15 ***
## ConditionGood 5.025e+04 6.157e+03 8.161 3.90e-16 ***
## ConditionGood-VG 5.442e+04 6.176e+03 8.810 < 2e-16 ***
## ConditionVery Good 5.704e+04 6.179e+03 9.231 < 2e-16 ***
## ConditionExcellent 6.455e+04 6.330e+03 10.197 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13740 on 7204 degrees of freedom
## Multiple R-squared: 0.8252, Adjusted R-squared: 0.8248
## F-statistic: 2267 on 15 and 7204 DF, p-value: < 2.2e-16
# Plot diagnostics
par(mfrow = c(2, 2))
plot(m0)
# Fit linear models with transformations
m1 <- lm(log(Price) ~ sqrt(Violence) + sqrt(Living_Area) + Year +
Bed + Bath + Condition,
data = X)
summary(m1)
##
## Call:
## lm(formula = log(Price) ~ sqrt(Violence) + sqrt(Living_Area) +
## Year + Bed + Bath + Condition, data = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27511 -0.07695 0.01528 0.09805 0.72980
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.884e+00 1.771e-01 38.860 < 2e-16 ***
## sqrt(Violence) -4.028e-02 9.175e-04 -43.908 < 2e-16 ***
## sqrt(Living_Area) 2.823e-02 3.803e-04 74.234 < 2e-16 ***
## Year 9.114e-04 8.078e-05 11.282 < 2e-16 ***
## Bed -9.644e-03 2.818e-03 -3.422 0.000626 ***
## Bath 4.061e-02 4.161e-03 9.759 < 2e-16 ***
## ConditionVery Poor 7.605e-01 1.045e-01 7.279 3.71e-13 ***
## ConditionPoor 9.360e-01 8.185e-02 11.436 < 2e-16 ***
## ConditionFair 1.068e+00 7.699e-02 13.875 < 2e-16 ***
## ConditionFair-Avg 1.131e+00 7.519e-02 15.043 < 2e-16 ***
## ConditionAverage 1.377e+00 7.404e-02 18.603 < 2e-16 ***
## ConditionAvg-Good 1.500e+00 7.398e-02 20.271 < 2e-16 ***
## ConditionGood 1.517e+00 7.398e-02 20.507 < 2e-16 ***
## ConditionGood-VG 1.552e+00 7.421e-02 20.911 < 2e-16 ***
## ConditionVery Good 1.580e+00 7.424e-02 21.277 < 2e-16 ***
## ConditionExcellent 1.650e+00 7.605e-02 21.702 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1651 on 7204 degrees of freedom
## Multiple R-squared: 0.7622, Adjusted R-squared: 0.7617
## F-statistic: 1540 on 15 and 7204 DF, p-value: < 2.2e-16
plot(m1)
There are 31 parcels beyond 4 standard errors below the regression line. This means the model is severely overestimating the value of these parcels. Let’s find out what these parcels are.
Average condition, although the majority of them are
Average or worse.Solutions. We try the following fixes:
Log transform living area: Refer to the previous correlation
plots and observe that Price and Living Area appear to have a strongly
correlated linear correlation. However, we performed a log
transformation on the Price variable but a square-root transformation on
the Living Area variable. This asymmetry may be the cause of the large
residuals.
Drop number of bedrooms: The number of bedrooms is highly correlated with the residuals, and we might infer that most of the information provided by this variable is already captured in the number of bathrooms.
Drop the 19 parcels: These parcels are in a small geographic area and may be subject to some unobserved spatial effect.
# Investigate the large negative residuals
r <- residuals(m1) / summary(m1)$sigma # standard residuals
o <- X[r < -4, ]
o$StdResid <- r[r < -4]
nrow(o)
## [1] 31
# Add a tooltip label for parcels
o <- o %>% mutate(label = paste0('Price: ', round(Price), '<br>',
'Year: ', Year, '<br>',
'Condition: ', Condition, '<br>',
'Living Area: ', Living_Area, '<br>',
'Bedrooms: ', Bed, '<br>',
'Bathrooms: ', Bath, '<br>',
'Violence: ', Violence
))
# Plot the offending parcels geographically
g <- ggplot(o) +
geom_sf(data = hartford_tracts, aes(text = NAME)) +
geom_sf(data = o,
aes(text = label)) +
stat_sf_coordinates(size = 1, color = "red") +
labs(x = "Latitude", y = "Longitude") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
toWebGL(ggplotly(g))
# Plot residuals vs. Condition
# Function for number of observations
give.n <- function(x){
return(data.frame(
y = quantile(x, .75) + 0.1,
label = paste0("n = ", length(x))
))
}
ggplot(data = o, aes(x = Condition, y = StdResid)) +
geom_boxplot() +
stat_summary(fun.data = give.n, geom = "text", fun.y = median) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme_bw()
# Plot residuals vs transformed numeric covariates
o <- o %>%
mutate(logPrice = log(Price),
sqrtViolence = sqrt(Violence),
sqrtLiving_Area = sqrt(Living_Area))
ggpairs(o, columns = c("StdResid", "sqrtViolence", "sqrtLiving_Area",
"Year", "Bed", "Bath"),
lower = list(continuous = "points"),
progress = F) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Let’s test our hypotheses.
Living_Area, the residuals are
more symmetrically distributed. We may have sacrificed some upper tail
normality for the lower tail, since there appear to be more residuals
larger than 4 now. However, we have reduced the total number of
residuals beyond 4 SE’s from 33 to 21. Notice that the Beds
variable is no longer statistically significant.# Take log of living area
m2 <- lm(log(Price) ~ sqrt(Violence) + log(Living_Area) + Year +
Bed + Bath + Condition,
data = X)
summary(m2)
##
## Call:
## lm(formula = log(Price) ~ sqrt(Violence) + log(Living_Area) +
## Year + Bed + Bath + Condition, data = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.28251 -0.08783 0.01367 0.10248 1.09006
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.174e+00 2.016e-01 20.709 < 2e-16 ***
## sqrt(Violence) -4.364e-02 9.532e-04 -45.777 < 2e-16 ***
## log(Living_Area) 5.548e-01 8.271e-03 67.070 < 2e-16 ***
## Year 7.687e-04 8.401e-05 9.150 < 2e-16 ***
## Bed -8.052e-04 2.920e-03 -0.276 0.783
## Bath 7.505e-02 4.174e-03 17.982 < 2e-16 ***
## ConditionVery Poor 7.686e-01 1.089e-01 7.059 1.84e-12 ***
## ConditionPoor 9.315e-01 8.531e-02 10.919 < 2e-16 ***
## ConditionFair 1.068e+00 8.025e-02 13.305 < 2e-16 ***
## ConditionFair-Avg 1.125e+00 7.838e-02 14.354 < 2e-16 ***
## ConditionAverage 1.365e+00 7.718e-02 17.691 < 2e-16 ***
## ConditionAvg-Good 1.485e+00 7.711e-02 19.252 < 2e-16 ***
## ConditionGood 1.500e+00 7.711e-02 19.450 < 2e-16 ***
## ConditionGood-VG 1.536e+00 7.735e-02 19.863 < 2e-16 ***
## ConditionVery Good 1.567e+00 7.738e-02 20.250 < 2e-16 ***
## ConditionExcellent 1.642e+00 7.927e-02 20.718 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1721 on 7204 degrees of freedom
## Multiple R-squared: 0.7417, Adjusted R-squared: 0.7411
## F-statistic: 1379 on 15 and 7204 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(m2)
# Drop number of bedrooms
m3 <- lm(log(Price) ~ sqrt(Violence) + log(Living_Area) + Year +
Bath + Condition,
data = X)
summary(m3)
##
## Call:
## lm(formula = log(Price) ~ sqrt(Violence) + log(Living_Area) +
## Year + Bath + Condition, data = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.28246 -0.08786 0.01381 0.10273 1.08541
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.180e+00 2.005e-01 20.844 < 2e-16 ***
## sqrt(Violence) -4.366e-02 9.488e-04 -46.017 < 2e-16 ***
## log(Living_Area) 5.535e-01 6.998e-03 79.099 < 2e-16 ***
## Year 7.694e-04 8.396e-05 9.164 < 2e-16 ***
## Bath 7.485e-02 4.105e-03 18.231 < 2e-16 ***
## ConditionVery Poor 7.685e-01 1.089e-01 7.058 1.84e-12 ***
## ConditionPoor 9.311e-01 8.530e-02 10.916 < 2e-16 ***
## ConditionFair 1.067e+00 8.023e-02 13.303 < 2e-16 ***
## ConditionFair-Avg 1.125e+00 7.836e-02 14.352 < 2e-16 ***
## ConditionAverage 1.365e+00 7.716e-02 17.690 < 2e-16 ***
## ConditionAvg-Good 1.484e+00 7.710e-02 19.252 < 2e-16 ***
## ConditionGood 1.499e+00 7.710e-02 19.449 < 2e-16 ***
## ConditionGood-VG 1.536e+00 7.733e-02 19.863 < 2e-16 ***
## ConditionVery Good 1.567e+00 7.737e-02 20.249 < 2e-16 ***
## ConditionExcellent 1.642e+00 7.926e-02 20.718 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.172 on 7205 degrees of freedom
## Multiple R-squared: 0.7416, Adjusted R-squared: 0.7411
## F-statistic: 1477 on 14 and 7205 DF, p-value: < 2.2e-16
plot(m3)
# Drop the 19 outlier parcels
XO <- X[!X$OBJECTID %in% o$OBJECTID, ]
m4 <- lm(log(Price) ~ sqrt(Violence) + log(Living_Area) + Year +
Bath + Condition,
data = XO)
summary(m4)
##
## Call:
## lm(formula = log(Price) ~ sqrt(Violence) + log(Living_Area) +
## Year + Bath + Condition, data = XO)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.63568 -0.08865 0.01138 0.09956 1.04299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.468e+00 1.920e-01 23.274 < 2e-16 ***
## sqrt(Violence) -4.375e-02 9.059e-04 -48.297 < 2e-16 ***
## log(Living_Area) 5.430e-01 6.697e-03 81.076 < 2e-16 ***
## Year 6.595e-04 8.042e-05 8.202 2.78e-16 ***
## Bath 7.561e-02 3.916e-03 19.311 < 2e-16 ***
## ConditionVery Poor 7.680e-01 1.037e-01 7.403 1.49e-13 ***
## ConditionPoor 9.317e-01 8.127e-02 11.464 < 2e-16 ***
## ConditionFair 1.104e+00 7.661e-02 14.406 < 2e-16 ***
## ConditionFair-Avg 1.144e+00 7.470e-02 15.314 < 2e-16 ***
## ConditionAverage 1.375e+00 7.352e-02 18.702 < 2e-16 ***
## ConditionAvg-Good 1.485e+00 7.346e-02 20.215 < 2e-16 ***
## ConditionGood 1.501e+00 7.346e-02 20.434 < 2e-16 ***
## ConditionGood-VG 1.542e+00 7.368e-02 20.931 < 2e-16 ***
## ConditionVery Good 1.572e+00 7.372e-02 21.318 < 2e-16 ***
## ConditionExcellent 1.669e+00 7.557e-02 22.082 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1639 on 7174 degrees of freedom
## Multiple R-squared: 0.755, Adjusted R-squared: 0.7545
## F-statistic: 1579 on 14 and 7174 DF, p-value: < 2.2e-16
plot(m4)
# Compare residuals beyond 4 SE's
r1 <- residuals(m1) / summary(m1)$sigma
sum(r1 < -4 | r1 > 4)
## [1] 33
r3 <- residuals(m3) / summary(m3)$sigma
sum(r3 < -4 | r3 > 4)
## [1] 22
r4 <- residuals(m4) / summary(m4)$sigma
sum(r4 < -4 | r4 > 4)
## [1] 5
Replacing the Violence computed by counts with
Violence_KDE computed by kernel density estimation
increases the adjusted \(R^2\) and
reduces residual standard error. From the plots, there is not a
substantial change in the residuals. Thus, we can conclude that the KDE
of violent crimes is a better predictor of property value than the count
of violent crimes. Our final model is:
\[ \log \text{Price} = \beta_0 + \beta_1 \sqrt{\text{Violence_KDE}} + \beta_2 \log(\text{Living_Area}) + \beta_3 \text{Year} + \beta_4 \text{Bath} + \beta_5 \text{Condition} \]
m5 <- lm(log(Price) ~ sqrt(Violence_KDE) + log(Living_Area) + Year +
Bath + Condition,
data = XO)
summary(m5)
##
## Call:
## lm(formula = log(Price) ~ sqrt(Violence_KDE) + log(Living_Area) +
## Year + Bath + Condition, data = XO)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.63300 -0.09047 0.01031 0.09823 1.05442
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.645e+00 1.892e-01 24.557 < 2e-16 ***
## sqrt(Violence_KDE) -1.736e+03 3.370e+01 -51.514 < 2e-16 ***
## log(Living_Area) 5.434e-01 6.587e-03 82.505 < 2e-16 ***
## Year 5.973e-04 7.919e-05 7.543 5.15e-14 ***
## Bath 7.241e-02 3.856e-03 18.778 < 2e-16 ***
## ConditionVery Poor 7.661e-01 1.020e-01 7.508 6.71e-14 ***
## ConditionPoor 9.043e-01 7.995e-02 11.312 < 2e-16 ***
## ConditionFair 1.076e+00 7.536e-02 14.282 < 2e-16 ***
## ConditionFair-Avg 1.118e+00 7.349e-02 15.216 < 2e-16 ***
## ConditionAverage 1.352e+00 7.232e-02 18.693 < 2e-16 ***
## ConditionAvg-Good 1.459e+00 7.226e-02 20.191 < 2e-16 ***
## ConditionGood 1.474e+00 7.226e-02 20.404 < 2e-16 ***
## ConditionGood-VG 1.518e+00 7.248e-02 20.941 < 2e-16 ***
## ConditionVery Good 1.548e+00 7.252e-02 21.346 < 2e-16 ***
## ConditionExcellent 1.650e+00 7.434e-02 22.192 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1612 on 7174 degrees of freedom
## Multiple R-squared: 0.763, Adjusted R-squared: 0.7625
## F-statistic: 1649 on 14 and 7174 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(m5)
Spatial regression
https://oerstatistics.wordpress.com/wp-content/uploads/2016/03/intro_to_r.pdf#page=68.08
https://crd230.github.io/lab8.html
Kernel density estimation